Skip to content

fix: improve Codex parity across provider, runtime, and workspace flows#6831

Closed
VasuBansal7576 wants to merge 16 commits intoaden-hive:mainfrom
VasuBansal7576:codex/pr3-codex-workspace-parity
Closed

fix: improve Codex parity across provider, runtime, and workspace flows#6831
VasuBansal7576 wants to merge 16 commits intoaden-hive:mainfrom
VasuBansal7576:codex/pr3-codex-workspace-parity

Conversation

@VasuBansal7576
Copy link
Copy Markdown
Contributor

@VasuBansal7576 VasuBansal7576 commented Mar 27, 2026

Description

Consolidates the Codex parity work into one PR. This includes the adapter-first Codex provider integration plus the shared backend/runtime/workspace fixes that Codex was exposing more aggressively than other models.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)

Related Issues

Related to #6741
Related to #6817

Changes Made

  • add the adapter-first Codex provider/backend path and wire it through the existing provider/runner contracts
  • harden validation, rerun/input propagation, worker handoff, stop behavior, result visibility, and no-progress detection in the backend/runtime path
  • improve workspace parity for Codex-built agents by gating Run correctly and preserving structured run inputs for reruns
  • add Codex parity and Codex-vs-control regression coverage for the targeted flows

Testing

Ran the following tests locally:

  • cd core && uv run pytest tests/test_litellm_provider.py tests/test_config.py -q
  • cd core && uv run pytest tests/test_codex_control_parity.py tests/test_codex_parity_gate.py tests/test_codex_planning_phase.py tests/test_event_loop_node.py tests/test_queen_lifecycle_validation.py tests/test_queen_orchestrator.py tests/test_session_manager_worker_handoff.py tests/test_session_manager_worker_validation.py tests/test_validate_agent_path.py tests/test_worker_monitoring_tools.py framework/server/tests/test_api.py -q
  • cd core/frontend && npm run test -- --run src/lib/run-inputs.test.ts
  • cd core/frontend && npm run build
  • Manual testing performed in the UI on the full stacked branch

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Screenshots (if applicable)

N/A

Summary by CodeRabbit

  • New Features

    • Worker validation prevents running invalid packages.
    • Structured run inputs: prompts collect required parameters before triggering runs; run button shown only when ready.
    • Worker health monitoring: detailed health/status and issue signals in worker status.
    • Worker rerun uses recent input defaults.
    • Terminal-stop chat: users can end queen sessions with terminal choices.
  • Bug Fixes

    • Improved streaming recovery for certain LLM backends.
    • Conversation cursor now merges state instead of overwriting.
    • More robust token expiry detection and session resume behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Codex backend support and adapters, centralizes Codex request/headers handling, refactors LLM request assembly, introduces worker/package validation gating and queen suspend behavior, enhances conversation cursor/output-key handling, implements structured run inputs in the frontend, and expands extensive tests and monitoring tools.

Changes

Cohort / File(s) Summary
Codex transport & adapter
core/framework/llm/codex_backend.py, core/framework/llm/codex_adapter.py
New modules: constants, api-base detection/normalization, header/kwargs builders, system-prompt chunking, tool-choice derivation, request hardening, empty-stream recovery, and streamed tool-call merging.
LLM provider refactor
core/framework/llm/litellm.py, core/framework/config.py
Centralized request/message builders; delegated Codex behavior to adapter; replaced hardcoded Codex api_base with builders; added Codex kwargs normalization and JSON-mode/system-prompt handling.
Runner & token handling
core/framework/runner/runner.py
Improved Codex token expiry detection (JWT/expires), adjusted LLM provider selection (Codex/Kimi Code branches) and uses build_codex_litellm_kwargs() for Codex setup.
Graph / conversation changes
core/framework/graph/conversation.py, core/framework/graph/executor.py, core/framework/graph/prompt_composer.py
Added optional output_keys to system-prompt updates, cursor merge updates (write-merge vs clobber), adjusted transition-marker tool list computation, and removed execution/node-type preamble params from prompt composer.
Runtime events & execution stream
core/framework/runtime/event_bus.py, core/framework/runtime/execution_stream.py
Extended CLIENT_INPUT_REQUESTED payload with auto_blocked/assistant_text flags; refined handler logging; session-state persistence now selectively backfills entry-node inputs from outputs.
Server routes & validation
core/framework/server/app.py, core/framework/server/routes_execution.py, core/framework/server/routes_sessions.py
Validate agent paths relative to repo root; SPA static handler avoids /api/* fallthrough; added worker-validation gating for trigger/resume/load endpoints; chat terminal-stop handling and backward-compatible stop behavior.
Session manager & worker validation
core/framework/server/session_manager.py, core/framework/server/queen_orchestrator.py
Subprocess-based package validation, WorkerValidationError, persist validation results on Session, added SessionManager.suspend_queen(), removed MCP registry init, and refined planning-ask counting logic.
Tools: queen lifecycle & monitoring
core/framework/tools/queen_lifecycle_tools.py, core/framework/tools/worker_monitoring_tools.py
Moved health snapshot derivation into read_worker_health_snapshot(), added classify_worker_health(), integrated health signals and judge-pressure detection, added structured input parsing, preflight validation, rerun support and rerun tool.
Frontend structured run inputs
core/frontend/src/lib/run-inputs.ts, core/frontend/src/pages/workspace.tsx, core/frontend/src/lib/run-inputs.test.ts
New run-input utilities and tests; workspace UI prompts for missing structured inputs, stores lastRunInputData, and gates Run button visibility.
Tools server & scaffolding
tools/coder_tools_server.py, tools/tests/test_coder_tools_server.py
Expanded package validation and behavior checks, accept exported names or paths, discover local tools, refine scaffold defaults, and richer validation/test reporting.
Tests added/updated
core/tests/... (many new/updated files)
Large suite of new and updated tests covering Codex parity, planning-phase gating, stream recovery and merging, validation gating, monitoring tools, session/queue behaviors, and event/flow integrations.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server as "Server (/chat)"
    participant Queen as "Queen Node"
    participant Adapter as "CodexResponsesAdapter"
    participant EventBus

    Client->>Server: POST /chat (message)
    Server->>Server: _worker_validation_error(session)
    alt validation passes
        Server->>Queen: inject_event(message, is_client_input=True)
        Queen->>Adapter: stream(messages, system, tools)
        Adapter->>Adapter: chunk_system_prompt() / build_system_messages()
        Adapter->>Adapter: harden_request_kwargs() (headers, store=False)
        Adapter-->>Queen: stream events (text/tool calls)
        Queen->>EventBus: emit CLIENT_OUTPUT_DELTA
        EventBus-->>Client: output delta
        Queen->>EventBus: emit CLIENT_INPUT_REQUESTED (auto_blocked?)
        EventBus-->>Client: input request
    else validation fails
        Server-->>Client: 409 Conflict (validation_failures)
    end
Loading
sequenceDiagram
    participant SessionManager
    participant Subprocess
    participant Validator as "validate_agent_package impl"
    participant Session
    participant Worker

    SessionManager->>Subprocess: run validate script (uv run python -c ...)
    Subprocess-->>Validator: stdout JSON report
    Validator-->>SessionManager: parsed report
    alt report valid
        SessionManager->>Worker: load/initialize worker
    else report invalid
        SessionManager-->>Session: raise WorkerValidationError / block load
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 I chewed through prompts and stitched the seam,

Codex chunks hop in a careful stream,
Guards for packages lock the door,
Queens can park and ask no more,
Tests bloom bright where rabbits dream.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 49.60% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: improving Codex parity across multiple system components (provider, runtime, workspace flows). It is specific, concise, and clearly communicates the primary objective without vague terms.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (5)
core/tests/test_event_loop_node.py-872-878 (1)

872-878: ⚠️ Potential issue | 🟡 Minor

Assert the ordering that this regression is about.

These assertions only prove both events were emitted. If CLIENT_INPUT_REQUESTED fires before the streamed explanation, the test still passes.

🧪 Tighten the assertion
         assert output_events
         assert input_events
+        assert received.index(output_events[0]) < received.index(input_events[0])
         assert "Root cause: checkout requests are failing" in output_events[0].data["snapshot"]
         assert input_events[0].data["prompt"] == "What would you like to do next?"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/tests/test_event_loop_node.py` around lines 872 - 878, The test only
asserts both CLIENT_OUTPUT_DELTA and CLIENT_INPUT_REQUESTED exist but not their
ordering; update the test to locate the indices of the first CLIENT_OUTPUT_DELTA
and first CLIENT_INPUT_REQUESTED in received (e.g. using enumerate and next or a
helper) and assert that the CLIENT_OUTPUT_DELTA index is less than the
CLIENT_INPUT_REQUESTED index before keeping the existing checks on
output_events[0].data["snapshot"] and input_events[0].data["prompt"]; reference
the variables received, output_events, input_events and EventType to find and
compare their positions.
core/framework/server/app.py-58-63 (1)

58-63: ⚠️ Potential issue | 🟡 Minor

Trimmed input is checked, but untrimmed input is still resolved.

At Line 51-56 you sanitize agent_path into raw_path, but Line 58 uses Path(agent_path) instead of Path(raw_path). Inputs like " examples/some_agent " will still fail unexpectedly.

💡 Suggested fix
-    candidate = Path(agent_path).expanduser()
+    candidate = Path(raw_path).expanduser()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/app.py` around lines 58 - 63, The code constructs
candidate using Path(agent_path) but earlier sanitizes the user input into
raw_path, so untrimmed inputs like "  examples/some_agent  " bypass trimming;
change the candidate initialization to use Path(raw_path) (i.e., replace
Path(agent_path) with Path(raw_path)) so all subsequent resolution and
_REPO_ROOT joining operate on the trimmed/sanitized value (ensure this change is
applied where candidate is assigned and continues to be used).
core/tests/test_session_manager_worker_handoff.py-174-177 (1)

174-177: ⚠️ Potential issue | 🟡 Minor

Replace fixed sleep with deterministic await to avoid flaky CI.

Using await asyncio.sleep(0.05) at Line 174 makes this test timing-sensitive. Prefer waiting until the mocked coroutine is actually awaited (with timeout).

💡 Suggested fix
-    await asyncio.sleep(0.05)
+    async def _wait_for_consolidate() -> None:
+        while consolidate.await_count == 0:
+            await asyncio.sleep(0)
+
+    await asyncio.wait_for(_wait_for_consolidate(), timeout=1.0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/tests/test_session_manager_worker_handoff.py` around lines 174 - 177,
Replace the flaky fixed sleep with a deterministic wait that times out: instead
of await asyncio.sleep(0.05), use asyncio.wait_for with a short polling loop
that checks the AsyncMock state until consolidate has been awaited (e.g., poll
consolidate.await_count or consolidate.awaited flag) and raise on timeout; then
run consolidate.assert_awaited_once() and assert
queen_node.inject_event.await_count == 0. Target the existing mock names
(consolidate and queen_node.inject_event) so the test reliably waits for the
mocked coroutine to be awaited rather than sleeping.
core/framework/runtime/event_bus.py-538-539 (1)

538-539: ⚠️ Potential issue | 🟡 Minor

Keep the traceback when a handler fails.

Line 539 downgrades this to a plain message, but these exceptions are otherwise swallowed by gather(..., return_exceptions=True). Without exc_info, this becomes the only place we could have captured the failing handler's stack trace.

Suggested fix
-                except Exception as e:
-                    logger.error(f"Handler error for {event.type}: {e}")
+                except Exception:
+                    logger.exception("Handler error for %s", event.type)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/runtime/event_bus.py` around lines 538 - 539, The except block
that currently does logger.error(f"Handler error for {event.type}: {e}") should
preserve the stack trace; update the handler exception logging (in the event
dispatch/handler loop in event_bus.py where the except Exception as e block is)
to include the exception info by using logger.error(..., exc_info=True) or
logger.exception(...) so the traceback is retained when a handler fails.
core/tests/test_codex_planning_phase.py-105-111 (1)

105-111: ⚠️ Potential issue | 🟡 Minor

Avoid sleeping to decide when the node is safe to stop.

Line 106 makes this test timing-sensitive: on a slow runner the block event may not have fired yet, and on a fast runner an extra loop can already start. Waiting on the actual signal is more stable.

Suggested fix
     async def shutdown_after_first_block() -> None:
-        await asyncio.sleep(0.05)
+        await asyncio.wait_for(blocked.wait(), timeout=1.0)
         node.signal_shutdown()

Add a shared event next to received, and set it from capture():

blocked = asyncio.Event()

async def capture(event: AgentEvent) -> None:
    received.append(event)
    if _client_input_counts_as_planning_ask(event):
        phase_state.planning_ask_rounds += 1
    blocked.set()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/tests/test_codex_planning_phase.py` around lines 105 - 111, The test is
timing-sensitive because shutdown_after_first_block uses asyncio.sleep; instead
create a shared asyncio.Event (e.g., blocked) alongside received, set
blocked.set() inside the capture(event: AgentEvent) callback after updating
received and phase_state.planning_ask_rounds (and after calling
_client_input_counts_as_planning_ask), and replace the sleep/wait logic in
shutdown_after_first_block with awaiting blocked.wait() before calling
node.signal_shutdown(); keep references to shutdown_after_first_block,
node.execute(ctx), capture, received, phase_state.planning_ask_rounds and
_client_input_counts_as_planning_ask to locate and update the code accordingly.
🧹 Nitpick comments (3)
core/tests/test_session_manager_worker_validation.py (1)

78-81: Avoid hard-coding subprocess arg positions in this assertion.

Lines 78-81 couple the test to the current argv layout instead of the behavior being verified. A harmless refactor of _run_validation_report_sync() can reshuffle the command and break this test even if it still invokes the right validator.

Suggested fix
-    script = captured["cmd"][4]
+    script = next(part for part in captured["cmd"] if "_validate_agent_package_impl" in str(part))
     assert "_validate_agent_package_impl" in script
     assert "validate_agent_package(agent_name)" not in script
-    assert captured["cmd"][6] == "/tmp/demo_agent"
+    assert any(part == "/tmp/demo_agent" for part in captured["cmd"])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/tests/test_session_manager_worker_validation.py` around lines 78 - 81,
The test currently hard-codes subprocess argument positions in assertions
against captured["cmd"]; instead, search the cmd list for the validator script
name and assert presence and correct surrounding args without relying on fixed
indices. Concretely, in the test that exercises _run_validation_report_sync(),
replace assertions like captured["cmd"][4] and captured["cmd"][6] with: assert
any(arg for arg in captured["cmd"] if "_validate_agent_package_impl" in arg),
then locate that element's index via captured["cmd"].index(...) and assert the
next/expected argument equals "/tmp/demo_agent" (and assert that the literal
"validate_agent_package(agent_name)" is not present anywhere using not
any(...)). This makes the test robust to reordering while still verifying the
validator is invoked with the correct agent path.
core/framework/server/session_manager.py (1)

211-250: Persist queen-local subscription IDs before parking.

suspend_queen() only removes the subscriptions stored on Session, but create_queen() also registers queen-local listeners on the shared EventBus. Without persisting those IDs, every park/revive cycle leaks another stale listener because this method has no way to unsubscribe them here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/session_manager.py` around lines 211 - 250,
suspend_queen currently only clears Session.worker_handoff_sub and
Session.memory_consolidation_sub but does not remove queen-local listeners that
create_queen registers on the shared EventBus, causing listener leaks; fix by
having create_queen store the subscription IDs it registers into a new Session
attribute (e.g., session.queen_local_subs or similar) and then modify
suspend_queen to iterate over session.queen_local_subs, call
session.event_bus.unsubscribe(...) for each saved id (with the same try/except
pattern used for other subs), and then clear session.queen_local_subs (set to
None or empty) along with the other session fields so all queen-local listeners
are unsubscribed when parking.
core/framework/tools/queen_lifecycle_tools.py (1)

1059-1078: Resolve the same entry point you trigger later.

This helper assumes entry_points[0] is the default entry, but the execution paths below all call runtime.trigger(entry_point_id="default", ...). Looking up that same entry spec here would make input shaping and reruns safe for multi-entry agents.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/tools/queen_lifecycle_tools.py` around lines 1059 - 1078, The
helper _get_default_entry_input_keys currently picks entry_points[0] but
runtime.trigger uses entry_point_id="default", so change the lookup to find the
entry spec whose identifier/name equals "default" (e.g., check attributes like
id or name on each spec) and fall back to the first spec if not found; then
derive entry_node_id from that matched entry_spec (or graph.entry_node if
missing) and use graph.get_node to return its input_keys (preserving existing
None/empty guards). This ensures _get_default_entry_input_keys resolves the same
entry point used by runtime.trigger.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/framework/graph/conversation.py`:
- Around line 354-363: The update_system_prompt behavior should clear the
previous node contract when the next node has no outputs; change
update_system_prompt so it sets self._output_keys to the provided output_keys
and if output_keys is None explicitly clear the contract by setting
self._output_keys = [] (rather than leaving it untouched). Update the logic in
update_system_prompt to reference the parameter output_keys and the attribute
_output_keys accordingly so a transition that passes None will remove prior
output keys.

In `@core/framework/llm/codex_adapter.py`:
- Around line 145-156: The harden_request_kwargs method currently uses
cleaned.setdefault("store", False) which allows an explicit store=True in
incoming kwargs to persist; change it to forcefully disable storage by assigning
cleaned["store"] = False (or removing any existing "store" and then setting it
to False) inside harden_request_kwargs so that Codex traffic cannot be
persisted; update references in harden_request_kwargs to ensure "store" is
always False regardless of incoming kwargs.

In `@core/framework/llm/codex_backend.py`:
- Around line 17-29: The current is_codex_api_base uses substring matching
(CODEX_API_BASE in api_base) which is unsafe; update is_codex_api_base(api_base:
str | None) to parse the URL (use urllib.parse.urlparse), validate scheme is
http/https, compare parsed.scheme + "://" + parsed.netloc equals the known
CODEX_API_ORIGIN (or derive origin from CODEX_API_BASE by parsing) and ensure
the parsed.path startswith the expected base path (e.g., "/v1" or the path
portion of CODEX_API_BASE) rather than using substring membership; ensure it
returns False for invalid URLs or different origins. Then update
normalize_codex_api_base to call the new is_codex_api_base and trim a trailing
"/responses" only when the parsed path ends with "/responses" and the
origin/path match, returning the reconstructed normalized origin+path (or None)
and preserving behavior for None inputs and malformed URLs by returning the
original input or None.

In `@core/framework/llm/litellm.py`:
- Around line 1249-1282: The normalization currently replaces true/false/null
inside quoted strings and the fenced-block stripping leaves language tags,
breaking parsing; update _normalize_pythonish_tool_arguments to only replace
unquoted literals (e.g., detect matches not inside single or double quotes —
either via a regex with lookarounds or by scanning the string and skipping
quoted spans) so "'null hypothesis'" remains unchanged, and change the
fenced-code removal in _parse_tool_call_arguments to strip an optional language
tag and optional newline (e.g., remove a leading pattern like ^```[^\n]*\n? and
a trailing ```), ensuring fenced JSON like ```json\n{...}\n``` parses correctly.
Reference functions: _normalize_pythonish_tool_arguments,
_parse_pythonish_tool_arguments, and _parse_tool_call_arguments.

In `@core/framework/server/routes_execution.py`:
- Around line 514-516: The stop endpoint should not be blocked by package
validation: remove or bypass the call to _worker_validation_error in the stop
handler so that the /stop route always proceeds to cancel/terminate the running
job even when validation fails; specifically, locate the stop route handler
where validation_err = await _worker_validation_error(session) is invoked and
either delete that check or add a conditional so that _worker_validation_error
is skipped for the "/stop" handler, ensuring the handler still performs its
existing cancellation logic and returns the appropriate stop response.

In `@core/framework/server/session_manager.py`:
- Around line 119-134: The subprocess.run call in validate_agent_package (the
block invoking ["uv", "run", "python", "-c", script, ...]) can raise
FileNotFoundError, subprocess.TimeoutExpired, or OSError and should be wrapped
in a try/except that catches those exceptions and returns the same structured
validation failure dict used for non-zero exit codes: {"valid": False,
"summary": f"validate_agent_package failed for '{agent_ref_str}'", "steps":
{"validator_subprocess": {"passed": False, "error":
<message_truncated_to_2000_chars>}}}; use str(exception) (or include timeout /
errno details) truncated to 2000 chars for the error field and keep the existing
call to _parse_validation_report(proc.stdout) when the subprocess completes
successfully.

In `@core/framework/tools/queen_lifecycle_tools.py`:
- Around line 4154-4179: rerun_worker_with_last_input currently builds
input_data from allowed_keys and thus drops legacy task-only payloads when an
entry node has no structured input_keys; change the assembly logic so that when
allowed_keys is empty or does not include legacy keys, you merge any existing
legacy fields from _load_recent_worker_input_defaults(runtime, allowed_keys)
(notably "task" and "user_request") into input_data before calling
runtime.trigger; update the code around allowed_keys, input_data, and the call
sites of _load_recent_worker_input_defaults, _normalize_worker_input_value, and
runtime.trigger so legacy "task"/"user_request" values are preserved for older
workers.
- Around line 989-1016: The helper currently scans all sessions via
sessions_dir.glob("session_*/state.json") and can leak inputs across unrelated
runs; restrict the search to the same run/session instead of global scan by
adding and using a run/session identifier (e.g., current_run_id or
current_session_id) and only process state files whose raw_state contains a
matching "run_id"/"session_id" (or whose parent directory matches the current
session identifier) before merging inputs; update the code that iterates
state_path and the caller of this function to accept the current run/session id,
and keep the existing logic around merged_input, work_keys, best_score,
best_updated_at and best_payload unchanged so only same-run state.json files are
considered.
- Around line 3863-3875: The code currently validates the new package after the
current worker has already been unloaded; move the validation and import
precheck to run before removing the active worker so a failed validation doesn't
leave the system without a live worker. Specifically, call
_run_package_validation(str(resolved_path)) and check
_validation_blocks_stage_or_run(validation_report) and _validation_failures(...)
(and the subsequent import precheck) before invoking whatever routine currently
unloads the active worker, and only proceed to unload and replace the worker
after these checks succeed.

In `@core/framework/tools/worker_monitoring_tools.py`:
- Around line 103-110: When resolving session_id (the logic around
session_id/default_session_id and sessions_dir) ensure you validate that the
selected session directory actually contains persisted worker state and not an
empty/stale folder; if the resolved (explicit or default) session dir is missing
required state files (the same checks used later for
session_status/health_status), return an error dict (e.g. {"error": "No
persisted worker state for session <id>"} ) instead of allowing session_status
to fall through to "unknown"/"healthy". Update the branch that picks session_id
(and the equivalent logic in the later block around lines 128-202) to perform
this existence/content check on (sessions_dir / session_id) and bail out with an
error when the directory is empty or lacks expected state artifacts.

In `@core/frontend/src/lib/run-inputs.ts`:
- Around line 41-46: The canShowRunButton predicate (canShowRunButton) currently
returns true when sessionId/ready/queenPhase are set, which exposes the Run
button before the workspace's graph/topology (structured input requirements)
finishes loading; update the predicate to also require a graph/topology-ready
flag (e.g., add a new parameter like graphReady or structuredInputsLoaded) and
include it in the Boolean check so the Run button only appears after
RUNNABLE_PHASES.has(queenPhase) && graphReady; alternatively, if you prefer not
to change the UI predicate, modify handleRun to call
getStructuredRunInputKeys(...) and bail (or prompt for inputs) when keys are
missing until graph metadata is loaded—reference canShowRunButton, handleRun,
and getStructuredRunInputKeys to locate where to add the additional readiness
check.

In `@core/frontend/src/pages/workspace.tsx`:
- Around line 2888-2915: The code sets lastRunInputData before
executionApi.trigger resolves, so on non-424 failures the questionnaire is lost;
change behavior to only update lastRunInputData inside the successful .then(...)
handler (i.e., remove lastRunInputData from the initial updateAgentState call)
or, if you prefer to keep the initial optimistic update, restore the
questionnaire in the .catch(...) non-424 branch by calling
updateAgentState(activeWorker, { pendingQuestions: /* original questions */ ,
pendingQuestion: /* original */, pendingOptions: /* original */,
lastRunInputData: null }) so failed validation doesn't clear the widget; update
references to updateAgentState, executionApi.trigger, lastRunInputData,
pendingQuestions and ensure appendSystemMessage and the 424 handling
(setCredentialAgentPath, setCredentialsOpen) remain unchanged.
- Around line 741-759: The cached run input (state.lastRunInputData) is
forwarded whole into inputData even when some structured keys were removed or
renamed; update the assignment for inputData to trim/filter
state.lastRunInputData to only include keys in requiredRunKeys (when
requiredRunKeys.length > 0) so stale fields are not sent to the worker—use
hasAllStructuredRunInputs to gate prompting as before, and when building/passing
inputs (inputData) create a new object that picks only requiredRunKeys from
state.lastRunInputData (falling back to {} if no requiredRunKeys or no cached
data) so updateAgentState and subsequent worker calls receive only the current
schema fields.

In `@tools/coder_tools_server.py`:
- Around line 2124-2143: The tests block incorrectly marks failing or errored
runs as passed; update the logic in the tests result handling (where
steps["tests"] is populated) so that when "error" in test_result OR when not
all_passed (computed from test_result.get("failed"/"errors")), set
steps["tests"]["passed"] = False (and still populate warning/warnings/failures
as currently done) instead of True; ensure validate_agent_package() will then
see the correct failed status by using the updated passed flag.
- Around line 510-524: The current classifier treats the "identity_prompt is
blank..." finding as non-blocking, letting agents with only an identity prompt
placeholder pass; update the blocking marker set in tools/coder_tools_server.py
by adding the exact emitted phrase or a robust substring such as
"identity_prompt is blank" (or "identity_prompt" if broader) to blocking_markers
so that when _behavior_validation_errors() yields that finding it is appended to
blocking instead of warnings.
- Line 2306: The current assignment for intro_message inserts the raw result of
_default_intro_message(human_name, _draft_desc) into a quoted Python literal,
which breaks when the draft contains quotes or newlines; replace that raw
interpolation by emitting a properly escaped Python string literal (e.g., use
repr(...) or json.dumps(...) on the _default_intro_message(...) result) so
intro_message receives a safe, escaped string; update the code that builds the
config text where intro_message: str = "{_default_intro_message(human_name,
_draft_desc)}" is produced to instead call repr/_default_escape on
_default_intro_message(human_name, _draft_desc) before embedding.

---

Minor comments:
In `@core/framework/runtime/event_bus.py`:
- Around line 538-539: The except block that currently does
logger.error(f"Handler error for {event.type}: {e}") should preserve the stack
trace; update the handler exception logging (in the event dispatch/handler loop
in event_bus.py where the except Exception as e block is) to include the
exception info by using logger.error(..., exc_info=True) or
logger.exception(...) so the traceback is retained when a handler fails.

In `@core/framework/server/app.py`:
- Around line 58-63: The code constructs candidate using Path(agent_path) but
earlier sanitizes the user input into raw_path, so untrimmed inputs like " 
examples/some_agent  " bypass trimming; change the candidate initialization to
use Path(raw_path) (i.e., replace Path(agent_path) with Path(raw_path)) so all
subsequent resolution and _REPO_ROOT joining operate on the trimmed/sanitized
value (ensure this change is applied where candidate is assigned and continues
to be used).

In `@core/tests/test_codex_planning_phase.py`:
- Around line 105-111: The test is timing-sensitive because
shutdown_after_first_block uses asyncio.sleep; instead create a shared
asyncio.Event (e.g., blocked) alongside received, set blocked.set() inside the
capture(event: AgentEvent) callback after updating received and
phase_state.planning_ask_rounds (and after calling
_client_input_counts_as_planning_ask), and replace the sleep/wait logic in
shutdown_after_first_block with awaiting blocked.wait() before calling
node.signal_shutdown(); keep references to shutdown_after_first_block,
node.execute(ctx), capture, received, phase_state.planning_ask_rounds and
_client_input_counts_as_planning_ask to locate and update the code accordingly.

In `@core/tests/test_event_loop_node.py`:
- Around line 872-878: The test only asserts both CLIENT_OUTPUT_DELTA and
CLIENT_INPUT_REQUESTED exist but not their ordering; update the test to locate
the indices of the first CLIENT_OUTPUT_DELTA and first CLIENT_INPUT_REQUESTED in
received (e.g. using enumerate and next or a helper) and assert that the
CLIENT_OUTPUT_DELTA index is less than the CLIENT_INPUT_REQUESTED index before
keeping the existing checks on output_events[0].data["snapshot"] and
input_events[0].data["prompt"]; reference the variables received, output_events,
input_events and EventType to find and compare their positions.

In `@core/tests/test_session_manager_worker_handoff.py`:
- Around line 174-177: Replace the flaky fixed sleep with a deterministic wait
that times out: instead of await asyncio.sleep(0.05), use asyncio.wait_for with
a short polling loop that checks the AsyncMock state until consolidate has been
awaited (e.g., poll consolidate.await_count or consolidate.awaited flag) and
raise on timeout; then run consolidate.assert_awaited_once() and assert
queen_node.inject_event.await_count == 0. Target the existing mock names
(consolidate and queen_node.inject_event) so the test reliably waits for the
mocked coroutine to be awaited rather than sleeping.

---

Nitpick comments:
In `@core/framework/server/session_manager.py`:
- Around line 211-250: suspend_queen currently only clears
Session.worker_handoff_sub and Session.memory_consolidation_sub but does not
remove queen-local listeners that create_queen registers on the shared EventBus,
causing listener leaks; fix by having create_queen store the subscription IDs it
registers into a new Session attribute (e.g., session.queen_local_subs or
similar) and then modify suspend_queen to iterate over session.queen_local_subs,
call session.event_bus.unsubscribe(...) for each saved id (with the same
try/except pattern used for other subs), and then clear session.queen_local_subs
(set to None or empty) along with the other session fields so all queen-local
listeners are unsubscribed when parking.

In `@core/framework/tools/queen_lifecycle_tools.py`:
- Around line 1059-1078: The helper _get_default_entry_input_keys currently
picks entry_points[0] but runtime.trigger uses entry_point_id="default", so
change the lookup to find the entry spec whose identifier/name equals "default"
(e.g., check attributes like id or name on each spec) and fall back to the first
spec if not found; then derive entry_node_id from that matched entry_spec (or
graph.entry_node if missing) and use graph.get_node to return its input_keys
(preserving existing None/empty guards). This ensures
_get_default_entry_input_keys resolves the same entry point used by
runtime.trigger.

In `@core/tests/test_session_manager_worker_validation.py`:
- Around line 78-81: The test currently hard-codes subprocess argument positions
in assertions against captured["cmd"]; instead, search the cmd list for the
validator script name and assert presence and correct surrounding args without
relying on fixed indices. Concretely, in the test that exercises
_run_validation_report_sync(), replace assertions like captured["cmd"][4] and
captured["cmd"][6] with: assert any(arg for arg in captured["cmd"] if
"_validate_agent_package_impl" in arg), then locate that element's index via
captured["cmd"].index(...) and assert the next/expected argument equals
"/tmp/demo_agent" (and assert that the literal
"validate_agent_package(agent_name)" is not present anywhere using not
any(...)). This makes the test robust to reordering while still verifying the
validator is invoked with the correct agent path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e29641bb-ac20-4d06-a0bd-fd343184ec19

📥 Commits

Reviewing files that changed from the base of the PR and between 86ef6fd and e28d989.

📒 Files selected for processing (37)
  • core/framework/config.py
  • core/framework/graph/conversation.py
  • core/framework/graph/event_loop_node.py
  • core/framework/graph/executor.py
  • core/framework/graph/prompt_composer.py
  • core/framework/llm/codex_adapter.py
  • core/framework/llm/codex_backend.py
  • core/framework/llm/litellm.py
  • core/framework/runner/runner.py
  • core/framework/runtime/event_bus.py
  • core/framework/runtime/execution_stream.py
  • core/framework/server/app.py
  • core/framework/server/queen_orchestrator.py
  • core/framework/server/routes_execution.py
  • core/framework/server/routes_sessions.py
  • core/framework/server/session_manager.py
  • core/framework/server/tests/test_api.py
  • core/framework/tools/queen_lifecycle_tools.py
  • core/framework/tools/worker_monitoring_tools.py
  • core/frontend/src/lib/run-inputs.test.ts
  • core/frontend/src/lib/run-inputs.ts
  • core/frontend/src/pages/workspace.tsx
  • core/tests/test_codex_control_parity.py
  • core/tests/test_codex_parity_gate.py
  • core/tests/test_codex_planning_phase.py
  • core/tests/test_config.py
  • core/tests/test_continuous_conversation.py
  • core/tests/test_event_loop_node.py
  • core/tests/test_litellm_provider.py
  • core/tests/test_queen_lifecycle_validation.py
  • core/tests/test_queen_orchestrator.py
  • core/tests/test_session_manager_worker_handoff.py
  • core/tests/test_session_manager_worker_validation.py
  • core/tests/test_validate_agent_path.py
  • core/tests/test_worker_monitoring_tools.py
  • tools/coder_tools_server.py
  • tools/tests/test_coder_tools_server.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
core/framework/tools/queen_lifecycle_tools.py (1)

987-999: ⚠️ Potential issue | 🟠 Major

Stop falling back to unrelated sessions when session_id is known.

If a specific session_id is supplied but its state.json does not exist yet, this still falls back to session_*/state.json and can borrow another run's defaults. That reintroduces cross-session input bleed on fresh runs/reruns.

Suggested fix
     if session_id:
         state_path = sessions_dir / session_id / "state.json"
         candidate_state_paths = [state_path] if state_path.exists() else []
     else:
         candidate_state_paths = []

-    if not candidate_state_paths:
+    if session_id and not candidate_state_paths:
+        return {}
+
+    if not candidate_state_paths:
         candidate_state_paths = sorted(
             sessions_dir.glob("session_*/state.json"),
             key=lambda path: path.stat().st_mtime if path.exists() else 0,
             reverse=True,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/tools/queen_lifecycle_tools.py` around lines 987 - 999, When a
specific session_id is given the code currently falls back to other sessions'
state.json if that session's state file is missing; stop that by only using the
glob-based fallback when no session_id was provided. Change the fallback
condition so the sorted(sessions_dir.glob("session_*/state.json")) assignment
runs only when session_id is falsy (e.g. replace "if not candidate_state_paths:"
with "if not candidate_state_paths and not session_id:" or equivalent). This
touches the candidate_state_paths variable and the session_id/state.json logic
in queen_lifecycle_tools.py (look for the session_id block and the later
fallback that calls sessions_dir.glob).
🧹 Nitpick comments (9)
core/frontend/src/lib/run-inputs.ts (1)

1-8: Consider extracting QueenPhase to a shared location.

The QueenPhase type is duplicated here and in chat-helpers.ts. If phases change, both need updates. A shared types.ts or barrel export would reduce drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/frontend/src/lib/run-inputs.ts` around lines 1 - 8, QueenPhase is
duplicated across modules; extract the type into a single shared module and
import it where needed. Create a shared types file (e.g., export type QueenPhase
= "planning" | "building" | "staging" | "running";), export the QueenPhase type,
then replace the inline declaration in run-inputs.ts (the QueenPhase type) and
the duplicate in chat-helpers.ts to import { QueenPhase } from that shared
module; ensure any existing references (e.g., RUNNABLE_PHASES or functions
expecting QueenPhase) still type-check after the import and update exports if
you need to re-export the type from a barrel.
core/framework/llm/litellm.py (2)

1696-1697: Remove duplicate import.

FinishEvent, TextDeltaEvent, and TextEndEvent are already imported at line 1684-1689 within this function.

♻️ Remove duplicate import
             if not output_text:
                 return []
-            from framework.llm.stream_events import FinishEvent, TextDeltaEvent, TextEndEvent
-
             usage = getattr(response, "usage", None)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/llm/litellm.py` around lines 1696 - 1697, Remove the duplicate
import of FinishEvent, TextDeltaEvent, and TextEndEvent: locate the redundant
"from framework.llm.stream_events import FinishEvent, TextDeltaEvent,
TextEndEvent" inside the same function (the earlier import at the top of that
function already brings these in) and delete this duplicate import so the three
symbols are only imported once.

663-668: Minor: redundant conditional branches for max_tokens.

Both branches of the if not stream / else set max_tokens identically. This could be simplified to a single unconditional assignment.

♻️ Suggested simplification
-        if not stream:
-            kwargs["max_tokens"] = max_tokens
-        else:
-            kwargs["max_tokens"] = max_tokens
-            if not self._is_anthropic_model():
-                kwargs["stream_options"] = {"include_usage": True}
+        kwargs["max_tokens"] = max_tokens
+        if stream and not self._is_anthropic_model():
+            kwargs["stream_options"] = {"include_usage": True}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/llm/litellm.py` around lines 663 - 668, The two-branch
conditional redundantly assigns kwargs["max_tokens"] in litellm.py; simplify by
assigning kwargs["max_tokens"] = max_tokens unconditionally, then keep the
stream-specific logic: if stream is truthy and not self._is_anthropic_model()
add kwargs["stream_options"] = {"include_usage": True}. Update the block around
the existing use of variables stream, max_tokens, kwargs and the call to
self._is_anthropic_model() in the same function to remove the unused if/else.
tools/coder_tools_server.py (2)

2073-2086: In-process import for behavior validation may leak module state across calls.

The behavior validation imports the agent module in-process (lines 2073-2086) after clearing stale entries from sys.modules. While the cache clearing helps, this approach differs from other validation steps that use subprocesses for isolation. If _validate_agent_package_impl is called multiple times in the same process with different agents, the sys.path modification (line 2076) accumulates paths.

This is likely acceptable for the current use case (subprocess-per-validation from session_manager), but consider documenting this limitation or using a subprocess for full isolation if the function is called in-process repeatedly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/coder_tools_server.py` around lines 2073 - 2086, The code in
_validate_agent_package_impl modifies sys.path by inserting package_parent and
clears sys.modules before importlib.import_module(package_name), which can
accumulate package_parent entries if called multiple times; update the function
to either (a) restore sys.path after the import (remove the inserted
package_parent if you added it) to avoid path accumulation and ensure you only
clear the specific sys.modules entries you previously removed, or (b) switch
this validation path to run in a subprocess for full isolation (matching other
validators); also add a short docstring/note on _validate_agent_package_impl
documenting the in-process limitation if you keep the current approach.

477-486: Entry-node intake detection may have false positives for legitimate parsing nodes.

The heuristic at lines 477-486 flags entry nodes that look like "intake/config parsers" based on keyword matching. However, agents that legitimately need to validate/normalize input before doing real work may get blocked. The condition combines intake_like and not direct_work_like with other signals, but terms like "validate", "normalize", "config", "path" (lines 463-475) are common in legitimate work nodes too.

Consider documenting the escape hatch (e.g., adding a direct-work hint to the node description) or relaxing the blocking behavior to a warning for edge cases where real work nodes happen to mention validation.

core/framework/server/tests/test_api.py (4)

663-691: Make terminal-choice test fully deterministic and side-effect explicit.

Using free-form "No, stop here" may couple this unit test to fuzzy matching. Prefer the canonical option string and assert queen.inject_event is not called before queen is parked.

Suggested test hardening
 async def test_chat_done_for_now_parks_queen_without_new_followup(self):
@@
-        async with TestClient(TestServer(app)) as client:
+        queen_node = session.queen_executor.node_registry["queen"]
+        async with TestClient(TestServer(app)) as client:
             resp = await client.post(
                 "/api/sessions/test_agent/chat",
-                json={"message": "No, stop here"},
+                json={"message": "Done for now"},
             )
@@
-        queen_node = session.queen_executor
-        assert queen_node is None
+        queen_node.inject_event.assert_not_awaited()
+        assert session.queen_executor is None
         session.event_bus.emit_client_output_delta.assert_awaited_once()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/tests/test_api.py` around lines 663 - 691, The test
test_chat_done_for_now_parks_queen_without_new_followup should use the exact
terminal option string from the event history instead of free-form input and
explicitly assert queen.inject_event is not called before the queen is parked;
update the POST payload to use "Done for now" (matching the options list from
session.event_bus.get_history) and add an assertion that
session.queen_executor.inject_event was not awaited/called prior to verifying
queen_node is None, ensuring the test checks both deterministic input and the
absence of the side-effect on queen.inject_event.

931-943: Assert stop result payload, not just HTTP status.

The test name says stop behavior should continue despite validation failures; asserting stopped in response would validate the functional outcome too.

Suggested assertion addition
         async with TestClient(TestServer(app)) as client:
             resp = await client.post(
                 "/api/sessions/test_agent/stop",
                 json={"execution_id": "exec_abc"},
             )
             assert resp.status == 200
+            data = await resp.json()
+            assert data["stopped"] is True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/tests/test_api.py` around lines 931 - 943, The test
test_stop_ignores_worker_validation_failure currently only asserts HTTP 200;
after posting to "/api/sessions/test_agent/stop" with execution_id "exec_abc"
using TestClient(TestServer(app)), parse the JSON response body and assert the
expected functional outcome (e.g., that the payload contains "stopped": true or
the appropriate field indicating the execution was stopped) so the test verifies
the stop result in addition to resp.status; update the assertion after the POST
accordingly and keep existing setup of session.worker_validation_failures and
session.worker_runtime._mock_streams["default"]._execution_tasks["exec_abc"]
intact.

867-891: Add a no-execution assertion for resume validation blocking.

Like /trigger, /resume should prove that runtime execution is not started when worker validation fails.

Suggested test hardening
 async def test_resume_blocks_invalid_loaded_worker(self, sample_session, tmp_agent_dir):
@@
         session = _make_session(tmp_dir=tmp_path / ".hive" / "agents" / agent_name)
+        session.worker_runtime.trigger = AsyncMock()
         session.worker_validation_report = {
             "valid": False,
             "steps": {"tool_validation": {"passed": False}},
         }
@@
             assert data["validation_failures"] == [
                 "tool_validation: missing tool execute_command_tool"
             ]
+        session.worker_runtime.trigger.assert_not_awaited()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/tests/test_api.py` around lines 867 - 891, The test
should also assert that runtime execution is not started when worker validation
fails: mock or spy the function that the resume endpoint would call to begin
execution (e.g., the Session.start_runtime or the app's runtime-start function
used by the "/api/sessions/{agent}/resume" handler) before making the POST, then
after the POST assert that this start function was never called (or that no
runtime task/process was created), in addition to the existing 409 and
validation failure assertions; update test_resume_blocks_invalid_loaded_worker
to include that no-start assertion.

588-606: Also assert trigger is not invoked on validation failure.

This test checks the 409 payload, but it should additionally guard against accidental runtime execution when validation fails.

Suggested test hardening
 async def test_trigger_blocks_invalid_loaded_worker(self):
     session = _make_session()
+    session.worker_runtime.trigger = AsyncMock()
     session.worker_validation_report = {
         "valid": False,
         "steps": {"behavior_validation": {"passed": False}},
     }
@@
             data = await resp.json()
             assert "failed validation" in data["error"]
             assert data["validation_failures"] == ["behavior_validation: placeholder prompt"]
+        session.worker_runtime.trigger.assert_not_awaited()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/tests/test_api.py` around lines 588 - 606, The test
test_trigger_blocks_invalid_loaded_worker should also assert that the runtime
trigger path is not executed when validation fails; add a spy/mock (e.g.,
unittest.mock.AsyncMock) on the method that the endpoint would call to run the
agent (attach to the session object used in the test — e.g., session.trigger or
the worker run method used by your codebase) before creating the app, perform
the POST as shown, then assert that the mocked method was not awaited/called;
keep the existing 409 and payload assertions and only add this non-invocation
check to prevent accidental runtime execution.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/framework/graph/executor.py`:
- Around line 1477-1480: The call to
continuous_conversation.update_system_prompt currently uses
list(next_spec.output_keys or []) which converts None to an empty list and thus
overwrites prior output keys; change it to preserve None semantics by passing
None when next_spec.output_keys is None and otherwise passing
list(next_spec.output_keys) so that update_system_prompt receives None to mean
"no change" and a list only when keys are explicitly provided.

In `@core/framework/tools/queen_lifecycle_tools.py`:
- Around line 4192-4200: The persisted defaults loaded via
_load_recent_worker_input_defaults are being replayed wholesale into input_data,
which can reintroduce stale keys; change the construction of input_data to first
filter the dict returned by _load_recent_worker_input_defaults(runtime,
allowed_keys, session_id=session_id) to only include keys present in
allowed_keys (as computed by _get_default_entry_input_keys(runtime)), then pass
those filtered items through _normalize_worker_input_value before building
input_data; preserve the existing legacy/no-input-keys fallback behavior (do not
change the fallback path) and ensure references remain to allowed_keys,
_load_recent_worker_input_defaults, _normalize_worker_input_value, runtime, and
session_id so only current allowed keys are replayed.
- Around line 1228-1238: In _run_package_validation: do not return None when the
validator is missing, when executor raises, or when _parse_validation_report
returns None; instead return a failure report dict so callers treat it as a
validation failure. Locate registry._tools.get("validate_agent_package"), call
validator.executor({"agent_name": agent_ref}) inside a try/except, await
coroutine results as now, and if validator is None or an exception is raised or
_parse_validation_report(result) yields None, return a dict like {"valid":
False, "errors": [...], "raw_result": result_or_error} (or similar schema used
by callers) so the gate fails closed; keep using
_parse_validation_report(result) for the normal success path.

---

Duplicate comments:
In `@core/framework/tools/queen_lifecycle_tools.py`:
- Around line 987-999: When a specific session_id is given the code currently
falls back to other sessions' state.json if that session's state file is
missing; stop that by only using the glob-based fallback when no session_id was
provided. Change the fallback condition so the
sorted(sessions_dir.glob("session_*/state.json")) assignment runs only when
session_id is falsy (e.g. replace "if not candidate_state_paths:" with "if not
candidate_state_paths and not session_id:" or equivalent). This touches the
candidate_state_paths variable and the session_id/state.json logic in
queen_lifecycle_tools.py (look for the session_id block and the later fallback
that calls sessions_dir.glob).

---

Nitpick comments:
In `@core/framework/llm/litellm.py`:
- Around line 1696-1697: Remove the duplicate import of FinishEvent,
TextDeltaEvent, and TextEndEvent: locate the redundant "from
framework.llm.stream_events import FinishEvent, TextDeltaEvent, TextEndEvent"
inside the same function (the earlier import at the top of that function already
brings these in) and delete this duplicate import so the three symbols are only
imported once.
- Around line 663-668: The two-branch conditional redundantly assigns
kwargs["max_tokens"] in litellm.py; simplify by assigning kwargs["max_tokens"] =
max_tokens unconditionally, then keep the stream-specific logic: if stream is
truthy and not self._is_anthropic_model() add kwargs["stream_options"] =
{"include_usage": True}. Update the block around the existing use of variables
stream, max_tokens, kwargs and the call to self._is_anthropic_model() in the
same function to remove the unused if/else.

In `@core/framework/server/tests/test_api.py`:
- Around line 663-691: The test
test_chat_done_for_now_parks_queen_without_new_followup should use the exact
terminal option string from the event history instead of free-form input and
explicitly assert queen.inject_event is not called before the queen is parked;
update the POST payload to use "Done for now" (matching the options list from
session.event_bus.get_history) and add an assertion that
session.queen_executor.inject_event was not awaited/called prior to verifying
queen_node is None, ensuring the test checks both deterministic input and the
absence of the side-effect on queen.inject_event.
- Around line 931-943: The test test_stop_ignores_worker_validation_failure
currently only asserts HTTP 200; after posting to
"/api/sessions/test_agent/stop" with execution_id "exec_abc" using
TestClient(TestServer(app)), parse the JSON response body and assert the
expected functional outcome (e.g., that the payload contains "stopped": true or
the appropriate field indicating the execution was stopped) so the test verifies
the stop result in addition to resp.status; update the assertion after the POST
accordingly and keep existing setup of session.worker_validation_failures and
session.worker_runtime._mock_streams["default"]._execution_tasks["exec_abc"]
intact.
- Around line 867-891: The test should also assert that runtime execution is not
started when worker validation fails: mock or spy the function that the resume
endpoint would call to begin execution (e.g., the Session.start_runtime or the
app's runtime-start function used by the "/api/sessions/{agent}/resume" handler)
before making the POST, then after the POST assert that this start function was
never called (or that no runtime task/process was created), in addition to the
existing 409 and validation failure assertions; update
test_resume_blocks_invalid_loaded_worker to include that no-start assertion.
- Around line 588-606: The test test_trigger_blocks_invalid_loaded_worker should
also assert that the runtime trigger path is not executed when validation fails;
add a spy/mock (e.g., unittest.mock.AsyncMock) on the method that the endpoint
would call to run the agent (attach to the session object used in the test —
e.g., session.trigger or the worker run method used by your codebase) before
creating the app, perform the POST as shown, then assert that the mocked method
was not awaited/called; keep the existing 409 and payload assertions and only
add this non-invocation check to prevent accidental runtime execution.

In `@core/frontend/src/lib/run-inputs.ts`:
- Around line 1-8: QueenPhase is duplicated across modules; extract the type
into a single shared module and import it where needed. Create a shared types
file (e.g., export type QueenPhase = "planning" | "building" | "staging" |
"running";), export the QueenPhase type, then replace the inline declaration in
run-inputs.ts (the QueenPhase type) and the duplicate in chat-helpers.ts to
import { QueenPhase } from that shared module; ensure any existing references
(e.g., RUNNABLE_PHASES or functions expecting QueenPhase) still type-check after
the import and update exports if you need to re-export the type from a barrel.

In `@tools/coder_tools_server.py`:
- Around line 2073-2086: The code in _validate_agent_package_impl modifies
sys.path by inserting package_parent and clears sys.modules before
importlib.import_module(package_name), which can accumulate package_parent
entries if called multiple times; update the function to either (a) restore
sys.path after the import (remove the inserted package_parent if you added it)
to avoid path accumulation and ensure you only clear the specific sys.modules
entries you previously removed, or (b) switch this validation path to run in a
subprocess for full isolation (matching other validators); also add a short
docstring/note on _validate_agent_package_impl documenting the in-process
limitation if you keep the current approach.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 274a781f-cf56-4307-8c2b-b13d4022b6c5

📥 Commits

Reviewing files that changed from the base of the PR and between e28d989 and 148f61a.

📒 Files selected for processing (19)
  • core/framework/graph/executor.py
  • core/framework/llm/codex_adapter.py
  • core/framework/llm/codex_backend.py
  • core/framework/llm/litellm.py
  • core/framework/server/routes_execution.py
  • core/framework/server/session_manager.py
  • core/framework/server/tests/test_api.py
  • core/framework/tools/queen_lifecycle_tools.py
  • core/framework/tools/worker_monitoring_tools.py
  • core/frontend/src/lib/run-inputs.test.ts
  • core/frontend/src/lib/run-inputs.ts
  • core/frontend/src/pages/workspace.tsx
  • core/tests/test_config.py
  • core/tests/test_litellm_provider.py
  • core/tests/test_queen_lifecycle_validation.py
  • core/tests/test_session_manager_worker_validation.py
  • core/tests/test_worker_monitoring_tools.py
  • tools/coder_tools_server.py
  • tools/tests/test_coder_tools_server.py
✅ Files skipped from review due to trivial changes (1)
  • tools/tests/test_coder_tools_server.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • core/tests/test_litellm_provider.py
  • core/tests/test_config.py
  • core/tests/test_session_manager_worker_validation.py
  • core/frontend/src/pages/workspace.tsx
  • core/framework/server/session_manager.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
core/tests/test_queen_lifecycle_validation.py (1)

596-601: Unused mock attribute can be removed for clarity.

_get_primary_session_state is mocked but never called since run_agent_with_input now passes session_state=None directly. The mock is harmless but misleading.

🔧 Remove unused mock
     runtime = SimpleNamespace(
         resume_timers=MagicMock(),
         trigger=AsyncMock(),
-        _get_primary_session_state=MagicMock(return_value={}),
         graph=SimpleNamespace(nodes=[]),
     )

And similarly at line 637:

     runtime = SimpleNamespace(
         resume_timers=MagicMock(),
         trigger=AsyncMock(return_value="exec-1"),
-        _get_primary_session_state=MagicMock(return_value={}),
         get_entry_points=lambda: [SimpleNamespace(id="default", entry_node="process")],

Also applies to: 637-637

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/tests/test_queen_lifecycle_validation.py` around lines 596 - 601, The
runtime fixture creation includes an unused mocked attribute
_get_primary_session_state which is no longer called because
run_agent_with_input passes session_state=None; remove the
_get_primary_session_state=MagicMock(...) entry from the SimpleNamespace
initialization in both places (the runtime declarations around the test at lines
~596 and ~637) so the runtime only contains resume_timers, trigger, and graph to
avoid misleading unused mocks.
core/framework/tools/queen_lifecycle_tools.py (2)

2995-3021: Judge pressure detection has a potential edge case with mixed actions.

The streak detection breaks on ACCEPT or any action not in _NON_ACCEPT_JUDGE_ACTIONS, which includes actions not in the expected set. If a new verdict action is added (e.g., "PAUSE"), it would break the streak even though it's not an ACCEPT.

This is likely acceptable since unknown actions are rare, but worth noting for future maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/tools/queen_lifecycle_tools.py` around lines 2995 - 3021, The
current _get_recent_judge_pressure streak loop breaks on any action not in
_NON_ACCEPT_JUDGE_ACTIONS which causes new/unknown non-ACCEPT actions (e.g.,
"PAUSE") to prematurely end the streak; change the loop in
_get_recent_judge_pressure to only break when action == "ACCEPT" and treat any
other non-empty action as part of the non-ACCEPT streak (append unknown actions
to streak so they count), keeping the existing compression step and final
threshold check; reference the function name _get_recent_judge_pressure and the
constant _NON_ACCEPT_JUDGE_ACTIONS when making this update.

1071-1108: Preflight timeout is appropriate but consider surfacing the warning.

The 15-second timeout with a warning log is reasonable. However, if preflight times out, the worker proceeds without credential validation or MCP resync. The queen won't know about this unless it checks logs.

Consider returning a status flag or including a warning in the trigger response so the queen can inform the user that preflight was incomplete.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/tools/queen_lifecycle_tools.py` around lines 1071 - 1108,
_change the _preflight_worker_run signature and behavior to surface when
preflight did not complete: have _preflight_worker_run return a bool (True =
preflight completed, False = preflight incomplete/timed out) and set a
short-lived flag on the session (e.g., session.preflight_incomplete_warning =
True) when the timeout path is hit so callers (the queen/trigger response) can
surface the warning; specifically, update async def
_preflight_worker_run(session, runtime, timeout_seconds) to return bool, set
cred_error handling and MCP resync as-is but in the except TimeoutError block
set session.preflight_incomplete_warning = True and return False, otherwise
return True when _preflight finishes successfully so callers can include the
warning in the trigger response.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@core/framework/tools/queen_lifecycle_tools.py`:
- Around line 2995-3021: The current _get_recent_judge_pressure streak loop
breaks on any action not in _NON_ACCEPT_JUDGE_ACTIONS which causes new/unknown
non-ACCEPT actions (e.g., "PAUSE") to prematurely end the streak; change the
loop in _get_recent_judge_pressure to only break when action == "ACCEPT" and
treat any other non-empty action as part of the non-ACCEPT streak (append
unknown actions to streak so they count), keeping the existing compression step
and final threshold check; reference the function name
_get_recent_judge_pressure and the constant _NON_ACCEPT_JUDGE_ACTIONS when
making this update.
- Around line 1071-1108: _change the _preflight_worker_run signature and
behavior to surface when preflight did not complete: have _preflight_worker_run
return a bool (True = preflight completed, False = preflight incomplete/timed
out) and set a short-lived flag on the session (e.g.,
session.preflight_incomplete_warning = True) when the timeout path is hit so
callers (the queen/trigger response) can surface the warning; specifically,
update async def _preflight_worker_run(session, runtime, timeout_seconds) to
return bool, set cred_error handling and MCP resync as-is but in the except
TimeoutError block set session.preflight_incomplete_warning = True and return
False, otherwise return True when _preflight finishes successfully so callers
can include the warning in the trigger response.

In `@core/tests/test_queen_lifecycle_validation.py`:
- Around line 596-601: The runtime fixture creation includes an unused mocked
attribute _get_primary_session_state which is no longer called because
run_agent_with_input passes session_state=None; remove the
_get_primary_session_state=MagicMock(...) entry from the SimpleNamespace
initialization in both places (the runtime declarations around the test at lines
~596 and ~637) so the runtime only contains resume_timers, trigger, and graph to
avoid misleading unused mocks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d908ef30-3cd3-44ef-a9da-3ece023ede89

📥 Commits

Reviewing files that changed from the base of the PR and between 148f61a and f56600f.

📒 Files selected for processing (2)
  • core/framework/tools/queen_lifecycle_tools.py
  • core/tests/test_queen_lifecycle_validation.py

Copy link
Copy Markdown
Collaborator

@RichardTang-Aden RichardTang-Aden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the change for orchastrator

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/framework/server/queen_orchestrator.py`:
- Around line 33-36: The fallback that computes requires_input is unreachable
because prompt is falsy at that point; update the fallback logic in
queen_orchestrator.py so that when data.get("assistant_text_requires_input") is
None you compute requires_input from data.get("assistant_text_present") only
(i.e., treat assistant_text_present as the indicator) instead of using
data.get("assistant_text_present") and data.get("prompt"); locate the code that
sets requires_input and change the fallback expression accordingly (or
alternatively simplify to default False by using
data.get("assistant_text_requires_input", False) if that behavior is desired).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b123881-9847-470b-8ee9-18333df9ca7e

📥 Commits

Reviewing files that changed from the base of the PR and between f56600f and 220beb5.

📒 Files selected for processing (1)
  • core/framework/server/queen_orchestrator.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
core/framework/llm/litellm.py (4)

1720-1720: Remove duplicate import.

FinishEvent, TextDeltaEvent, and TextEndEvent are already imported at Lines 1708-1713. This inner import is redundant.

Remove duplicate import
             if not output_text:
                 return []
-            from framework.llm.stream_events import FinishEvent, TextDeltaEvent, TextEndEvent
-
             usage = getattr(response, "usage", None)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/llm/litellm.py` at line 1720, There is a redundant inner
import of FinishEvent, TextDeltaEvent, and TextEndEvent in litellm.py; remove
the duplicated import statement (the one importing FinishEvent, TextDeltaEvent,
TextEndEvent from framework.llm.stream_events around the inner block) since
those symbols are already imported earlier (lines importing
FinishEvent/TextDeltaEvent/TextEndEvent), leaving only the original top-level
import intact.

685-690: Simplify redundant conditional.

Both branches set max_tokens identically. The stream_options logic can be expressed more concisely.

Simplify max_tokens and stream_options assignment
-        if not stream:
-            kwargs["max_tokens"] = max_tokens
-        else:
-            kwargs["max_tokens"] = max_tokens
-            if not self._is_anthropic_model():
-                kwargs["stream_options"] = {"include_usage": True}
+        kwargs["max_tokens"] = max_tokens
+        if stream and not self._is_anthropic_model():
+            kwargs["stream_options"] = {"include_usage": True}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/llm/litellm.py` around lines 685 - 690, The code sets
kwargs["max_tokens"] in both branches redundantly; simplify by setting
kwargs["max_tokens"] = max_tokens once, then only add kwargs["stream_options"] =
{"include_usage": True} when stream is truthy and self._is_anthropic_model() is
false. Locate the block around the use of stream, kwargs, and the call to
self._is_anthropic_model() in litellm.py and replace the two-branch assignment
with a single max_tokens assignment and a single conditional that adds
stream_options only for non-Anthropic models when streaming.

1-35: Missing from __future__ import annotations import.

The coding guidelines require from __future__ import annotations for modern type syntax. This file uses type hints extensively but lacks this import.

Add future annotations import
 """LiteLLM provider for pluggable multi-provider LLM support.
 ...
 """
 
+from __future__ import annotations
+
 import ast
 import asyncio

As per coding guidelines: "Use from __future__ import annotations for modern type syntax".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/llm/litellm.py` around lines 1 - 35, Add the future
annotations import as the very first statement in the module: include "from
__future__ import annotations" at the top of core/framework/llm/litellm.py
before any other imports so forward references and postponed evaluation of type
hints work correctly for functions/classes like LLMProvider, LLMResponse, Tool
and any async iterator uses; ensure it remains the first non-comment line in the
file.

551-553: _codex_backend is a snapshot, not a live alias.

_codex_adapter.enabled is a @property that dynamically evaluates is_codex_api_base(self._provider.api_base). However, Line 553 captures this as a one-time boolean value at construction. If api_base were ever modified post-init, _codex_backend would become stale while _codex_adapter.enabled would reflect the update.

If this is intentional for backward compatibility with callers that expect a simple bool, consider adding a brief comment clarifying the snapshot semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/llm/litellm.py` around lines 551 - 553, The field
_codex_backend is currently assigned a one-time boolean from
CodexResponsesAdapter(self).enabled which snapshots
is_codex_api_base(self._provider.api_base) at construction and will go stale if
api_base changes; either make _codex_backend a live proxy (replace the snapshot
with a property or lambda that returns self._codex_adapter.enabled) or, if
snapshot semantics are intentional for backward compatibility, add a short
clarifying comment next to the assignment explaining that _codex_backend is a
construction-time snapshot and will not reflect subsequent changes to
_codex_adapter.enabled/_provider.api_base.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@core/framework/llm/litellm.py`:
- Line 1720: There is a redundant inner import of FinishEvent, TextDeltaEvent,
and TextEndEvent in litellm.py; remove the duplicated import statement (the one
importing FinishEvent, TextDeltaEvent, TextEndEvent from
framework.llm.stream_events around the inner block) since those symbols are
already imported earlier (lines importing
FinishEvent/TextDeltaEvent/TextEndEvent), leaving only the original top-level
import intact.
- Around line 685-690: The code sets kwargs["max_tokens"] in both branches
redundantly; simplify by setting kwargs["max_tokens"] = max_tokens once, then
only add kwargs["stream_options"] = {"include_usage": True} when stream is
truthy and self._is_anthropic_model() is false. Locate the block around the use
of stream, kwargs, and the call to self._is_anthropic_model() in litellm.py and
replace the two-branch assignment with a single max_tokens assignment and a
single conditional that adds stream_options only for non-Anthropic models when
streaming.
- Around line 1-35: Add the future annotations import as the very first
statement in the module: include "from __future__ import annotations" at the top
of core/framework/llm/litellm.py before any other imports so forward references
and postponed evaluation of type hints work correctly for functions/classes like
LLMProvider, LLMResponse, Tool and any async iterator uses; ensure it remains
the first non-comment line in the file.
- Around line 551-553: The field _codex_backend is currently assigned a one-time
boolean from CodexResponsesAdapter(self).enabled which snapshots
is_codex_api_base(self._provider.api_base) at construction and will go stale if
api_base changes; either make _codex_backend a live proxy (replace the snapshot
with a property or lambda that returns self._codex_adapter.enabled) or, if
snapshot semantics are intentional for backward compatibility, add a short
clarifying comment next to the assignment explaining that _codex_backend is a
construction-time snapshot and will not reflect subsequent changes to
_codex_adapter.enabled/_provider.api_base.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e55b49aa-3fe0-4956-abe0-69987f951d38

📥 Commits

Reviewing files that changed from the base of the PR and between 220beb5 and 29e85a1.

📒 Files selected for processing (3)
  • core/framework/llm/litellm.py
  • core/framework/runtime/execution_stream.py
  • core/tests/test_litellm_provider.py
✅ Files skipped from review due to trivial changes (1)
  • core/tests/test_litellm_provider.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • core/framework/runtime/execution_stream.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants